simple linear regression model \[
\text{hwy} \approx \beta_0 + \beta_1 \text{cty}
\]
residuals
least-square estimates
parameter interpretation
model comparison with $R^2%
outliers
Outline
multiple linear regression
feature engineering
model comparison
predictive performance
Multiple linear regression
Remember, to improve our initial model with \((\beta_0, \beta_1) = (1, 1.3)\), we could (i) find better estimates, (ii) use additional predictors
for (i), the least-square estimates are usually very good
for (ii), we use a multiple linear regression model
For instance, to predict hwy we could more variables than just cty.
The mpg data set
d <- ggplot2::mpghead(d, n =4)
# A tibble: 4 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa~
2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa~
3 audi a4 2 2008 4 manual(m6) f 20 31 p compa~
4 audi a4 2 2008 4 auto(av) f 21 30 p compa~
Instead of fitting a model with only cty or only displ, we could fit a model with both predictors!
Linear regression with 2 predictors
The model equation is \[
\text{hwy} \approx \beta_0 + \beta_1 \text{cty} + \beta_2 \text{displ}
\]
We can find the least-square estimate (minimizing the SSR [SSE]) with the command lm on R.
In a regression model, categorical predictors are represented using indicator variables.
To represent a categorical predictor with \(k\) levels, we use \((k-1)\) indicator variables.
Including drv
For instance, the categorical variable drv has \(k=3\) levels (4, f and r), so we can represent it with 2 indicator variables with the following model equation
Unsurprisingly, including additional predictors makes the regression line closer to the points \(\Rightarrow\) residuals are smaller \(\Rightarrow\) SSR is smaller \(\Rightarrow\)\(R^2\) is larger.
Fitting a larger model
Actually, I include all predictors (except for model).
m_larger <-lm(hwy ~ manufacturer + displ + year + cyl + trans + drv + cty + fl + class, data = d)
Thanks to the additional predictors, the residuals are very small, making \(R^2\) close to \(1\).
glance(m_larger)$r.squared
[1] 0.9773386
We will see in the next lecture that this is not always a good sign.
Group exercise - multiple linear regression
Exercises 8.9
In addition,
fit the model in R
identify the type of each variable
identify the baseline level of the categorical predictors
library(openintro)d_birth <- openintro::births14
05:00
Statistics as an art – feature engineering
We saw that adding predictors to the model seems to help.
However, the variables including in the data set, e.g. displ, year, etc, may not be the most useful predictors for hwy.
Feature engineering refers to the creation of new predictors from existing ones.
This is where your understanding of the data makes, scientific knowledge, etc, makes a big difference.
Transforming a variable
Consider the predictor displ
ggplot(d) +geom_point(aes(displ, hwy))
The relation bwteen displ and hwy is not exactly linear.
Let us include the predictor \(\dfrac{1}{\text{displ}}\) to capture this nonlinear relation.
The model equation is \[
\text{hwy} \approx \beta_0+ \beta_1 \text{displ} + \beta_2 \dfrac{1}{\text{displ}}
\] The least-square coefficient estimates are
We simply need to create a new variable corresponding to \((\dfrac{\text{Girth}}{2}})^2 * \text{Height}\). Note that I first need to transform the variable Girth into feet to ensure that all variable have the same unit.
where \(\beta_1\) indicate the effect of an additional miles on the duration and \(\beta_2\) the effect of bad weather (probably positive).
Note that the effect of weather is fixed in this model, say “\(+5\) minutes.
Is this reasonable? No!
The effect of weather should vary with distance. For shorter races, bad weather may add only 2 or 3 minutes, while for longer races, bad weather may increase the average duration by 10 or 15 minutes.
We capture this expected pattern using an interaction term.
When the weather is good, the equation simplifies to \[
\text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 0 + \beta_3 0*\text{distance} = \beta_0 + \beta_1 \text{distance}
\]
When the weather is bad, the equation simplifies to \[
\text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 1 + \beta_3 1*\text{distance} = (\beta_0 + \beta_2) + (\beta_1+\beta_3) \text{distance}
\]
The new slope term is \(\beta_1+\beta_3\), meaning that the effect of an additional miles on the average duration is \(\beta_1+\beta_3\) (not \(\beta_3\)).
The effect of distance varies depending on the weather; the two variable interact.
Recap
Recap
simple linear regression model \[
Y \approx \beta_0 + \beta_1 X
\]
multiple linear regression model \[
Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p
\]